In my previous notebook, I showed you how to read a Git repository directly in Python with Pandas and GitPython. As much as I like that approach (because everything is in one place and therefore reproducible), it's (currently) very slow while reading all the statistics information (but I'll work on that!). What I want to have now is a really fast method to read in a complete Git repository.
I take this opportunity to show you how to read any kind of structure, linear data into Pandas' DataFrame. The general rule of thumb is: As long as you see a pattern in the raw data, Pandas can read and tame it, too!
In this notebook, we analyze the Git repository of aim42 (an open book project about how to improve legacy systems).
The first entries of that file look something like this:
In [1]:
with open (r'data/gitlog_aim42.log') as log:
[print(line, end='') for line in log.readlines()[:8]]
For each commit, we choose to create a header line with the following commit info (by using --pretty=format:'--%h--%ad--%aN'):
--fa1ca6f--Thu Dec 22 08:04:18 2016 +0100--feststelltasteIt contains the SHA key, the timestamp as well as the author's name of the commit, separated by --. For each other row, we got some statistics about the modified files:
2 0 src/main/asciidoc/appendices/bibliography.adoc
It contains the number of lines inserted, the number of lines deleted and the relative path of the file. With a little trick and a little bit of data wrangling, we can read that information into a nicely structured DataFrame.
Let's get started!
First, I'll show you my approach on how to read nearly everything into a DataFrame. The key is to use Pandas' read_csv for reading "non-character separated values". How to do that? We simply choose a separator that doesn't occur in the file that we want to read. My favorite character for this is the "DEVICE CONTROL TWO" character U+0012. I haven't encountered a situation yet where this character was included in a data set.
We just read our git.log file without any headers (because there are none) and give the only column a nice name.
In [2]:
import pandas as pd
commits = pd.read_csv("data\gitlog_aim42.log",
sep="\u0012",
header=None,
names=['raw'])
commits.head()
Out[2]:
OK, but now we have a problem data wrangling challenge. We have the commit info as well as the statistic for the modified file in one column, but they don't belong together. What we want is to have the commit info along with the file statistics in separate columns to get some serious analysis started.
In [3]:
commit_marker = commits[
commits['raw'].str.startswith("--")]
commit_marker.head()
Out[3]:
With this, we can focus on extracting the information of a commit info row. The next command could be looking a little frightening, but don't worry. We go through it step by step.
In [4]:
commit_info = commit_marker['raw'].str.extract(
r"^--(?P<sha>.*?)--(?P<date>.*?)--(?P<author>.*?)$",
expand=True)
commit_info['date'] = pd.to_datetime(commit_info['date'])
commit_info.head()
Out[4]:
We want to extract some data from the raw column. For this, we use the extract method on the string representation (note the str) of all the rows. This method expects a regular expression. We provide our own regex
^--(?P<sha>.\*?)--(?P<date>.\*?)--(?P<author>.\*?)$that works as follows:
I use these ugly looking, named match groups because then the name of such a group will be used by Pandas for the name of the column (therefore we avoid renaming the columns later on).
The expand=True keyword delivers a DataFrame with columns for each detected regex group.
We simply store the result into a new DataFrame variable commit_info.
Because we've worked with the string representation of the row, Pandas didn't recognize the right data types for our newly created columns. That's why we need to cast the date column to the right type.
OK, this part is ready, let's have a look at the file statistics!
Every row that is not a commit info row is a file statistics row. So we just reuse the index of our already prepared commit_info DataFrame to get all the other data by saying "give me all commits that are not in the index of the commit_info's DataFrame".
In [5]:
file_stats_marker = commits[
~commits.index.isin(commit_info.index)]
file_stats_marker.head()
Out[5]:
Luckily, the row's data is just a tab-separated string that we can easily split with the split method. We expand the result to get a DataFrame , rename the default columns to something that make more sense and adjust some data types. For the later, we use the keyword coerce that will let to_numeric return Nan's for all entries that are not a number.
In [6]:
file_stats = file_stats_marker['raw'].str.split(
"\t", expand=True)
file_stats = file_stats.rename(
columns={ 0: "insertions", 1: "deletions", 2: "filename"})
file_stats['insertions'] = pd.to_numeric(
file_stats['insertions'], errors='coerce')
file_stats['deletions'] = pd.to_numeric(
file_stats['deletions'], errors='coerce')
file_stats.head()
Out[6]:
Now we have three parts: all commits, the separated commit info and the file statistics.
We only need to glue the commit info and the file statistics together into a normalized DataFrame. For this, we have to make some adjustments to the indexes.
For the commit info, we want to have each info for each file statistics row. That means we reindex the commit info by using the index of the commits DataFrame...
In [7]:
commit_info.reindex(commits.index).head(3)
Out[7]:
...and fill the missing values for the file statistics' rows to get the needed structure. Together, this is done like the following:
In [8]:
commit_data = commit_info.reindex(
commits.index).fillna(method="ffill")
commit_data.head()
Out[8]:
After filling the file statistics rows, we can throw away the dedicated commit info rows by reusing the index from above (look at the index for seeing this clearly).
In [9]:
commit_data = commit_data[~commit_data.index.isin(commit_info.index)]
commit_data.head()
Out[9]:
The easy step afterward is to join the file_stats DataFrame with the commit_data.
In [10]:
commit_data = commit_data.join(file_stats)
commit_data.head()
Out[10]:
We're done!
In [11]:
%%time
import pandas as pd
commits = pd.read_csv(r'C:\dev\repos\aim42\git.log', sep="\u0012", header=None, names=['raw'])
commit_marker = commits[commits['raw'].str.startswith("--",na=False)]
commit_info = commit_marker['raw'].str.extract(r"^--(?P<sha>.*?)--(?P<date>.*?)--(?P<author>.*?)$", expand=True)
commit_info['date'] = pd.to_datetime(commit_info['date'])
file_stats_marker = commits[~commits.index.isin(commit_info.index)]
file_stats = file_stats_marker['raw'].str.split("\t", expand=True)
file_stats = file_stats.rename(columns={0: "insertions", 1: "deletions", 2: "filename"})
file_stats['insertions'] = pd.to_numeric(file_stats['insertions'], errors='coerce')
file_stats['deletions'] = pd.to_numeric(file_stats['deletions'], errors='coerce')
commit_data = commit_info.reindex(commits.index).fillna(method="ffill")
commit_data = commit_data[~commit_data.index.isin(commit_info.index)]
commit_data = commit_data.join(file_stats)
Just some milliseconds to run through, not bad!
In this notebook, I showed you how to read some non-perfect structured data via the non-character separator trick. I also showed you how to transform the rows that contain multiple kinds of data into one nicely structured DataFrame.
Now that we have the Git repository DataFrame, we can do some nice things with it e. g. visualizing the code churn of a project, but that's a story for another notebook! But to give you a short preview:
In [12]:
%matplotlib inline
timed_commits = commit_data.set_index(pd.DatetimeIndex(commit_data['date']))[['insertions', 'deletions']].resample('1D').sum()
(timed_commits['insertions'] - timed_commits['deletions']).cumsum().fillna(method='ffill').plot()
Out[12]:
Stay tuned!